DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues

نویسندگان

  • Xin Ma
  • Jing Guo
  • Xiao Sun
چکیده

DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

MOTIVATION In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter value...

متن کامل

Predicting DNA-Binding Sites by Exploring the Distribution of Atom Groups around the Surface

DNA-binding proteins perform various functions in the cells. Determining the structures of protein-DNA complexes using experimental methods are hindered by many obstacles. Thus, computational methods for predicting DNAbinding sites on protein structures are needed to elucidate the mechanism of protein-DNA interactions. In this study, we divided atoms of amino acid residues into 14 groups and us...

متن کامل

A Novel Sequence-Based Feature for the Identification of DNA-Binding Sites in Proteins Using Jensen-Shannon Divergence

The knowledge of protein-DNA interactions is essential to fully understand the molecular activities of life. Many research groups have developed various tools which are either structureor sequence-based approaches to predict the DNA-binding residues in proteins. The structure-based methods usually achieve good results, but require the knowledge of the 3D structure of protein; while sequence-bas...

متن کامل

Predicting a DNA-binding protein using random forest with multiple mathematical features.

DNA-binding proteins are involved and play a crucial role in a lot of important biological processes. Hence, the identification of the DNA-binding proteins is a challenging and significant problem. In order to reveal the intrinsic information correlated to DNA-binding, nine classes of candidate features based on different mathematical fields are applied to construct the prediction model with ra...

متن کامل

Identification of DNA-Binding Proteins Using Support Vector Machine with Sequence Information

DNA-binding proteins are fundamentally important in understanding cellular processes. Thus, the identification of DNA-binding proteins has the particularly important practical application in various fields, such as drug design. We have proposed a novel approach method for predicting DNA-binding proteins using only sequence information. The prediction model developed in this study is constructed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2016